In this notebook, we will test and implement a variety of Deep Learning algorihtms to detect fraud in credit card transactions. We base our study in a Kaggle dataset: Credit Card Fraud Detection.
All credit card transactions were recorded during 2 days in September 2013. This dataset is highly unbalanced: 492 frauds out of 284807 transactions (0.172% of transactions).
The dataset contains in total 30 variables: 28 variables which are the result of a PCA transformation, the time and the amount of the transaction. There is no prior knowledge about the original features (due to confidentiality).
Furthermore, the recommended metric by Kaggle for this dataset is the Area Under the Precision-Recall Curve (AUPRC). This metric as well as the F1 score are used in this study in the comparison of the different algorithms.
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
# Plot libraries
from pylab import rcParams
import matplotlib.pyplot as plt
# Machine Learning tools
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, f1_score, average_precision_score
from sklearn.metrics import recall_score, precision_score
# Keras framework: Deep learning
from keras import backend as K
from keras.models import Model, load_model
from keras.layers import Input, Dense, Lambda, Dropout
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
from keras import regularizers, metrics
from scipy.stats import norm
# Import axiolib functions
import axiolib.plot as axioplot
from axiolib.plot import histogram_from_dataframe
axioplot.init_notebook_mode()
%matplotlib inline
# Setting seaborn configuration for the figures
sns.set(style='whitegrid', palette='muted', font_scale=1.5)
rcParams['figure.figsize'] = 14, 8
random_seed = 42
labels = ['Normal','Fraud']
NOTE: A data cleaning process was not necessary for the current dataset because the anonyme variables are the result of a dimensional reduction process (i.e. PCA), and there is no missing values for the time and amount variables.
data = pd.read_csv("s3://axiods/detection_de_fraude/creditcard.csv")
data.head()
zero and standard deviation of one).
data.describe()
Checking null or missing values in the dataset:
# To check if there is any missing value in the dataframe
data.isnull().sum()
Here we will search for some data patterns as:
Looking the distribution for the time feature:
# Analysing the 'TIME' feature of the dataset
print("Fraud transactions:")
print(data.Time[data.Class==1].describe())
print("Normal transactions:")
print(data.Time[data.Class==0].describe())
Looking the distribution for the amount feature:
# Analysing the 'AMOUNT' feature of the dataset
print("Fraud transactions:")
print(data.Amount[data.Class==1].describe())
print("Normal transactions:")
print(data.Amount[data.Class==0].describe())
# We need to transform the TIME variable from seconds to hours
data["hour"] = data.Time.map(lambda x: np.ceil(x / 3600))
axioplot.histogram_from_dataframe(data[data.Class==1],
label="hour",
maxnbinsx=50,
xaxis_title="Fraudulent transactions",
yaxis_title="Nb transactions by hour")
axioplot.histogram_from_dataframe(data[data.Class==0],
label="hour",
maxnbinsx=50,
xaxis_title="Normal transactions",
yaxis_title="Nb transactions by hour")
An autoencoder is a Neural Network for Unsupervised Learning. The advantage of unsupervised learning is to:
Common Loss function to minimize: Squared Error
NOTE: Other neural networks used in Unsupervised Learning are: Restricted Boltzmann Machines and Sparse Coding Models.
# Dropping 'Time'
X = data.drop(['Time','Class','hour'],axis=1)
Y = data.Class
# Standardising features
X = StandardScaler().fit_transform(X)
X_normal = X[data.Class == 0]
X_fraud = X[data.Class == 1]
Y_normal = Y[data.Class == 0]
Y_fraud = Y[data.Class == 1]
print("X shape:",np.shape(X))
print("Y shape:",np.shape(Y))
print("X_normal shape:",np.shape(X_normal))
print("X_fraud shape:",np.shape(X_fraud))
The dataset is splitted in three different sets: train, validation and test sets:
The test set contains normal transactions and all the fraudulent transactions of the dataset.
X_normal_train, X_normal_test, Y_normal_train, Y_normal_test = train_test_split(X_normal,Y_normal,test_size=0.2)
X_fraud_test, X_fraud_val, Y_fraud_test, Y_fraud_val = train_test_split(X_fraud,Y_fraud,test_size=0.5)
# Only normal cases for the VAL SET
X_train, X_val = train_test_split(X_normal_train,test_size=0.125)
#X_test = np.concatenate([X_normal_test,X_fraud_test],axis=0)
X_test = np.concatenate([X_normal_test,X_fraud],axis=0)
Y_test = np.concatenate([Y_normal_test,Y_fraud],axis=0)
print("X fraud test shape:",np.shape(X_fraud_test))
print("X normal test shape:",np.shape(X_normal_test))
print("X train shape:",np.shape(X_train))
print("X val shape:",np.shape(X_val))
print("X test shape",np.shape(X_test))
The autoencoder has an autoencoder and a decoder, both of them have an architecture based on a Multi-Layer Perceptron with the same number of hidden layers.
We use a L1 regularizer in the encoder, and a Dropout at the end of the Encoder.
input_dim = X_train.shape[1]
encoder_l1 = int(input_dim*0.5)
encoder_l2 = int(encoder_l1*0.5)
decoder_l1 = int(encoder_l2*2)
decoder_l2 = input_dim
dropout_prob = 0.2 # Percentage to keep
dropout_seed = 10
NOTE: The order of the activation functions in the decoder are not the same as in the encoder.
def get_autoencoder_model():
# Building the model
inputs = Input(shape=(input_dim,))
# ENCODER layers
encoder = Dense(units=encoder_l1, activation="tanh",
activity_regularizer=regularizers.l1(10e-5))(inputs)
encoder = Dense(units=encoder_l2, activation="relu")(encoder)
encoder = Dropout(dropout_prob,seed=dropout_seed)(encoder)
# DECODER layers
decoder = Dense(units=decoder_l1, activation="tanh")(encoder)
decoder = Dense(units=decoder_l2, activation="relu")(decoder)
# Defining the AUTOENCODER
autoencoder = Model(inputs=inputs, output=decoder)
# Compiling the model
autoencoder.compile(optimizer='Adam',loss='mean_squared_error',
metrics=['accuracy'])
return autoencoder
def get_autoencoder_model1():
# Building the model
inputs = Input(shape=(input_dim,))
# ENCODER layers
encoder = Dense(units=14, activation="tanh",
activity_regularizer=regularizers.l1(10e-5))(inputs)
encoder = Dense(units=14, activation="relu")(encoder)
encoder = Dense(units=7, activation="relu")(encoder)
encoder = Dropout(dropout_prob,seed=dropout_seed)(encoder)
# DECODER layers
decoder = Dense(units=7, activation="tanh")(encoder)
decoder = Dense(units=14, activation="relu")(decoder)
decoder = Dense(units=29, activation="relu")(decoder)
# Defining the AUTOENCODER
autoencoder = Model(inputs=inputs, output=decoder)
# Compiling the model
autoencoder.compile(optimizer='Adam',loss='mean_squared_error',
metrics=['accuracy'])
return autoencoder
There are two different models for the Undercomplete Autoencoder. The first one has 2 hidden layers in the autoencoder and 1 hidden layer in the decoder. The second model has one more hidden layer in both the Encoder and the Decoder.
# TRAINING process
nb_epochs = 500
batch_size = 2048
autoencoder = get_autoencoder_model()
autoencoder1 = get_autoencoder_model1()
print("="*50)
print("FIRST MODEL TRAINING")
print("="*50)
# Autoencoder: Defining checkpoints
bestModelFile = 'autoencoder_5.h5'
checkpoint = ModelCheckpoint(filepath=bestModelFile,verbose=0,
monitor='val_loss',mode='min',
save_best_only=True)
reduce_LR = ReduceLROnPlateau(monitor='val_loss',factor=0.5,
patience=10,verbose=False)
early_stop = EarlyStopping(monitor='val_loss',patience=50,verbose=True)
callbacks = [checkpoint, reduce_LR, early_stop]
history = autoencoder.fit(X_train,X_train,epochs=nb_epochs,
batch_size=batch_size,shuffle=True,
validation_data=(X_val,X_val),verbose=0,
callbacks=callbacks)
print("="*50)
print("SECOND MODEL TRAINING")
print("="*50)
# Autoencoder1: Defining checkpoints
bestModelFile1 = 'autoencoder_6.h5'
checkpoint = ModelCheckpoint(filepath=bestModelFile1,verbose=0,
monitor='val_loss',mode='min',
save_best_only=True)
callbacks = [checkpoint, reduce_LR, early_stop]
history1 = autoencoder1.fit(X_train,X_train,epochs=nb_epochs,
batch_size=batch_size,shuffle=True,
validation_data=(X_val,X_val),verbose=0,
callbacks=callbacks)
Working with the best model only:
First Autoencoder model
autoencoder = load_model(bestModelFile)
autoencoder.summary()
Second Autoencoder model
autoencoder1 = load_model(bestModelFile1)
autoencoder1.summary()
plt.subplot(2,1,1)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('loss')
plt.legend(['train','val'],loc='upper right')
plt.title("First Model")
plt.subplot(2,1,2)
plt.plot(history1.history['loss'])
plt.plot(history1.history['val_loss'])
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','val'],loc='upper right')
plt.title("Second Model")
print("-"*30)
print("FIRST MODEL")
print("Min val_loss:",np.min(history.history['val_loss']))
print("-"*30)
print("SECOND MODEL")
print("Min val_loss:",np.min(history1.history['val_loss']))
We can observe that the reconstruction errors for both the training and validation sets converge after 500 epochs. However, we only work with the model that best generalise the model for the validation set.
First, we should see the histogram of the reconstruction error for the train and test sets. This could give us a better idea about the threshold to consider to distinguish normal and fraudulent transactions.
Furthermore, we can observe if the model could reconstruct well the traning set.
X_train_pred = autoencoder.predict(X_train)
X_test_pred = autoencoder.predict(X_test)
train_mse = np.mean(np.power(X_train - X_train_pred, 2), axis =1)
test_mse = np.mean(np.power(X_test - X_test_pred, 2), axis =1)
max_mse = 100
f, ax = plt.subplots(3,1,sharex=True)
ax[0].hist(train_mse[(train_mse < max_mse)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Train - Normal transactions')
ax[1].hist(test_mse[(Y_test==0) & (test_mse < max_mse)],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse[(Y_test==1) & (test_mse < max_mse)],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')
print("-"*30)
print("FIRST MODEL")
print("-"*30)
print("Nb samples in test:",len(test_mse))
print("Nb samples with error MSE less than 10:",np.shape(test_mse[(Y_test==0) & (test_mse<10)]))
print("Nb samples of Fraud transactions:",np.sum(Y_test))
print("Max Error in the test set:",np.max(test_mse))
X_train_pred1 = autoencoder1.predict(X_train)
X_test_pred1 = autoencoder1.predict(X_test)
train_mse1 = np.mean(np.power(X_train - X_train_pred1, 2), axis =1)
test_mse1 = np.mean(np.power(X_test - X_test_pred1, 2), axis =1)
f, ax = plt.subplots(3,1,sharex=True)
ax[0].hist(train_mse1[(train_mse1<max_mse)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Train - Normal transactions')
ax[1].hist(test_mse1[(Y_test==0) & (test_mse1 < max_mse)],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse1[(Y_test==1) & (test_mse1 < max_mse)],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')
print("-"*30)
print("SECOND MODEL")
print("-"*30)
print("Nb samples in test:",len(test_mse1))
print("Nb samples with error MSE less than 10:",np.shape(test_mse1[(Y_test==0) & (test_mse1<10)]))
print("Nb samples of Fraud transactions:",np.sum(Y_test))
print("Max Error in the test set:",np.max(test_mse1))
# Setting the threshold from the last figure
min_threshold = 1.
max_threshold = 10.
threshold_step = 0.1
threshold_range = np.arange(min_threshold,max_threshold,threshold_step)
mdl1_recall = []
mdl1_precision = []
mdl1_f1 = []
mdl1_aucpr = []
mdl2_recall = []
mdl2_precision = []
mdl2_f1 = []
mdl2_aucpr = []
for thr in threshold_range:
#print("-"*50)
#print("threshold:",thr)
#print("-"*50)
Y_pred = [1 if e > thr else 0 for e in test_mse]
mdl1_f1.append(f1_score(Y_test,Y_pred))
mdl1_precision.append(precision_score(Y_test,Y_pred))
mdl1_recall.append(recall_score(Y_test,Y_pred))
mdl1_aucpr.append(average_precision_score(Y_test,Y_pred))
#print("FIRST MODEL")
#print("+"*30)
#print("Autoencoder - Confusion matrix:")
#print(confusion_matrix(Y_test,Y_pred))
#print("F1 score:",f1_score(Y_test,Y_pred))
#print("Precision score:",precision_score(Y_test,Y_pred))
#print("Recall score:",recall_score(Y_test,Y_pred))
#print("Average precision score:",average_precision_score(Y_test,Y_pred))
Y_pred1 = [1 if e > thr else 0 for e in test_mse1]
mdl2_f1.append(f1_score(Y_test,Y_pred1))
mdl2_precision.append(precision_score(Y_test,Y_pred1))
mdl2_recall.append(recall_score(Y_test,Y_pred1))
mdl2_aucpr.append(average_precision_score(Y_test,Y_pred1))
#print("+"*30)
#print("SECOND MODEL")
#print("+"*30)
#print("Autoencoder - Confusion matrix:")
#print(confusion_matrix(Y_test,Y_pred1))
#print("F1 score:",f1_score(Y_test,Y_pred1))
#print("Precision score:",precision_score(Y_test,Y_pred1))
#print("Recall score:",recall_score(Y_test,Y_pred1))
#print("Average precision score:",average_precision_score(Y_test,Y_pred1))
f, ax = plt.subplots(4,1,sharex=True)
ax[0].plot(threshold_range,mdl1_f1)
ax[0].set_ylabel('F1 SCORE')
ax[0].set_title('FIRST MODEL')
ax[1].plot(threshold_range,mdl1_precision)
ax[1].set_ylabel('PRECISION')
ax[2].plot(threshold_range,mdl1_recall)
ax[2].set_ylabel('RECALL')
ax[3].plot(threshold_range,mdl1_aucpr)
ax[3].set_ylabel('AUCPR')
ax[3].set_xlabel('Threshold')
f, ax = plt.subplots(4,1,sharex=True)
ax[0].plot(threshold_range,mdl2_f1)
ax[0].set_ylabel('F1 SCORE')
ax[0].set_title('SECOND MODEL')
ax[1].plot(threshold_range,mdl2_precision)
ax[1].set_ylabel('PRECISION')
ax[2].plot(threshold_range,mdl2_recall)
ax[2].set_ylabel('RECALL')
ax[3].plot(threshold_range,mdl2_aucpr)
ax[3].set_ylabel('AUCPR')
ax[3].set_xlabel('Threshold')
To chose a threshold to Fraud detection, we can set a minimum requirement for the recall metric. We could chose a threshold with a minimun recall of 0.8 but a maximum precision/f1-score.
# FIRST MODEL
mdl1_recall = np.array(mdl1_recall)
mdl1_f1 = np.array(mdl1_f1)
min_recall1 = 0.75
threshold_idx1 = np.argmax(mdl1_f1[mdl1_recall>min_recall1])
Y_pred = [1 if e > threshold_range[threshold_idx1] else 0 for e in test_mse]
print("FIRST MODEL")
print("-"*30)
print("Autoencoder - Confusion matrix:")
print(confusion_matrix(Y_test,Y_pred))
print("F1 score:",f1_score(Y_test,Y_pred))
print("Precision score:",precision_score(Y_test,Y_pred))
print("Recall score:",recall_score(Y_test,Y_pred))
print("Average precision score:",average_precision_score(Y_test,Y_pred))
# SECOND MODEL
mdl2_recall = np.array(mdl1_recall)
mdl2_f1 = np.array(mdl1_f1)
min_recall2 = 0.75
threshold_idx2 = np.argmax(mdl2_f1[mdl2_recall>min_recall2])
Y_pred1 = [1 if e > threshold_range[threshold_idx2] else 0 for e in test_mse1]
print("-"*30)
print("SECOND MODEL")
print("-"*30)
print("Autoencoder - Confusion matrix:")
print(confusion_matrix(Y_test,Y_pred1))
print("F1 score:",f1_score(Y_test,Y_pred1))
print("Precision score:",precision_score(Y_test,Y_pred1))
print("Recall score:",recall_score(Y_test,Y_pred1))
print("Average precision score:",average_precision_score(Y_test,Y_pred1))
batch_size = 2048
original_dim = 29 #X_train.shape[1]
latent_dim = 7 # Number of possible values of the output?
intermediate_dim = 14
epsilon_std = 1.0
def sampling(args):
'''
This function is to sample new similar points from the latent
space.
Source: blog.keras.io
'''
z_mean, z_log_var = args
epsilon = K.random_normal(shape=(K.shape(z_mean)[0], 7),
mean=0.0, stddev=1.0)
#epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim),
# mean=0, stddev=epsilon_std)
return z_mean + K.exp(z_log_var/2)*epsilon
def get_variatonal_autoencoder_model():
'''
This function creates the Variational Autoencoder model plus
an encoder and a generator models.
Architecture: 1 hidden layer + 1 output layer
Source: blog.keras.io
'''
# ENCODER
x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)
z_mean = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)
# SAMPLING
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean,z_log_var])
# DECODER
decoder_h = Dense(intermediate_dim, activation='relu')
decoder_mean = Dense(original_dim, activation='sigmoid')
h_decoded = decoder_h(z)
x_decoded_mean = decoder_mean(h_decoded)
# end-to-end Autoencoder
vae = Model(x,x_decoded_mean)
# ENCODER: from inputs to latent space
encoder = Model(x,z_mean)
# DECODER: from latent space to reconstructed inputs
decoder_input = Input(shape=(latent_dim,))
_h_decoded = decoder_h(decoder_input)
_x_decoded_mean = decoder_mean(_h_decoded)
generator = Model(decoder_input,_x_decoded_mean)
def vae_loss(x, x_decoded_mean):
'''
The loss function is the sum of a reconstruction loss and a KL
divergence regularization term.
'''
# Reconstruction loss
# Use of the binary crossentropy because there are only two classes
xent_loss = original_dim*metrics.binary_crossentropy(x,x_decoded_mean)
# KL divergence loss
kl_loss = -0.5*K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=1)
return K.mean(xent_loss + kl_loss)
# Training model
vae.compile(optimizer='rmsprop',loss=vae_loss)
vae.summary()
return vae
nb_epochs = 500
#vae_model = get_variatonal_autoencoder_model()
########################
# ENCODER
x = Input(shape=(original_dim,))
h = Dense(intermediate_dim, activation='relu')(x)
z_mean = Dense(latent_dim)(h)
z_log_var = Dense(latent_dim)(h)
# SAMPLING
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean,z_log_var])
# DECODER
decoder_h = Dense(intermediate_dim, activation='relu')
decoder_mean = Dense(original_dim, activation='sigmoid')
h_decoded = decoder_h(z)
x_decoded_mean = decoder_mean(h_decoded)
# end-to-end Autoencoder
vae_model = Model(x,x_decoded_mean)
# ENCODER: from inputs to latent space
encoder = Model(x,z_mean)
# DECODER: from latent space to reconstructed inputs
decoder_input = Input(shape=(latent_dim,))
_h_decoded = decoder_h(decoder_input)
_x_decoded_mean = decoder_mean(_h_decoded)
generator = Model(decoder_input,_x_decoded_mean)
def vae_loss(x, x_decoded_mean):
'''
The loss function is the sum of a reconstruction loss and a KL
divergence regularization term.
'''
# Reconstruction loss
# Use of the binary crossentropy because there are only two classes
xent_loss = original_dim*metrics.binary_crossentropy(x,x_decoded_mean)
# KL divergence loss
kl_loss = -0.5*K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=1)
return K.mean(xent_loss + kl_loss)
# Training model
vae_model.compile(optimizer='rmsprop',loss=vae_loss)
vae_model.summary()
########################
# Defining checkpoints
vae_bestModelFile = 'vae_autoencoder.h5'
vae_checkpoint = ModelCheckpoint(filepath=vae_bestModelFile,verbose=1,
monitor='val_loss',mode='min',
save_best_only=True)
vae_earlystop = EarlyStopping(monitor='val_loss',patience=20,
verbose=1,mode='min')
vae_history = vae_model.fit(X_train, X_train, shuffle=True,
epochs=nb_epochs, batch_size=batch_size,
validation_data=(X_test,X_test), verbose=0,
callbacks=[vae_checkpoint,vae_earlystop])
#vae_model = load_model(vae_bestModelFile)
#X_pred_vae = vae_model.predict(X_test)
vae_model.load_weights(vae_bestModelFile)
X_train_pred_vae = vae_model.predict(X_train)
X_test_pred_vae = vae_model.predict(X_test)
train_mse_vae = np.mean(np.power(X_train - X_train_pred_vae, 2), axis =1)
test_mse_vae = np.mean(np.power(X_test - X_test_pred_vae, 2), axis =1)
f, ax = plt.subplots(3,1)
#ax[0].hist(mse_vae[(Y_test==0) & (mse_vae<10)],bins=20)
ax[0].hist(train_mse_vae,bins=20)
#ax[0].hist(mse_vae[(Y_test==0)],bins=20)
ax[0].set_yscale('log')
ax[0].set_ylabel('Nb transactions')
ax[0].set_title('Normal transactions')
ax[1].hist(test_mse_vae[Y_test==0],bins=20)
ax[1].set_yscale('log')
ax[1].set_ylabel('Nb transactions')
ax[1].set_title('Test - Normal transactions')
ax[2].hist(test_mse_vae[Y_test==1],bins=20)
ax[2].set_ylabel('Nb transactions')
ax[2].set_title('Test - Fraud transactions')
ax[2].set_xlabel('Mean Squared Error (MSE)')
print("Nb samples in test:",len(test_mse_vae))
#print("Nb samples with error MSE less than 10:",np.shape(mse[(Y_test==0) & (mse<10)]))
#print("Nb samples of Fraud transactions:",np.sum(Y_test))
print(np.max(test_mse_vae))
# Setting the threshold from the last figure
threshold = 4.5
Y_pred_vae = [1 if e > threshold else 0 for e in test_mse_vae]
print(confusion_matrix(Y_test,Y_pred_vae))
print(f1_score(Y_test,Y_pred_vae))